Protein Identification from Tandem Mass Spectra with Probabilistic Language Modelling
نویسندگان
چکیده
This paper presents an interdisciplinary investigation of statistical information retrieval (IR) techniques for protein identification from tandem mass spectra, a challenging problem in proteomic data analysis. We formulate the task as an IR problem, by constructing a “query vector” whose elements are system-predicted peptides with confidence scores based on spectrum analysis of the input sample, and by defining the vector space of “documents” with protein profiles, each of which is constructed based on the theoretical spectrum of a protein. This formulation establishes a new connection from the protein identification problem to a probabilistic language modeling approach as well as the vector space models in IR, and enables us to compare fundamental differences in the IR models and common approaches in protein identification. Our experiments on benchmark spectrometry query sets and large protein databases demonstrate that the IR models significantly outperform wellestablished methods in protein identification, by enhancing precision in highrecall regions in particular, where the conventional approaches are weak.
منابع مشابه
Protein Identification from Tandem Mass Spectra with Probabilistic Language Modeling
This paper presents an interdisciplinary investigation of statistical information retrieval (IR) techniques for protein identification from tandem mass spectra, a challenging problem in proteomic data analysis. We formulate the task as an IR problem, by constructing a “query vector” whose elements are system-predicted peptides with confidence scores based on spectrum analysis of the input sampl...
متن کاملProbID: a probabilistic algorithm to identify peptides through sequence database searching using tandem mass spectral data.
With the recent quick expansion of DNA and protein sequence databases, intensive efforts are underway to interpret the linear genetic information of DNA in terms of function, structure, and control of biological processes. The systematic identification and quantification of expressed proteins has proven particularly powerful in this regard. Large-scale protein identification is usually achieved...
متن کاملSCOPE: a probabilistic model for scoring tandem mass spectra against a peptide database
Proteomics, or the direct analysis of the expressed protein components of a cell, is critical to our understanding of cellular biological processes in normal and diseased tissue. A key requirement for its success is the ability to identify proteins in complex mixtures. Recent technological advances in tandem mass spectrometry has made it the method of choice for high-throughput identification o...
متن کاملPreprocessing of tandem mass spectra using machine learning methods
Protein identification has been more helpful than before in the diagnosis and treatment of many diseases, such as cancer, heart disease and HIV. Tandem mass spectrometry is a powerful tool for protein identification. In a typical experiment, proteins are broken into small amino acid oligomers called peptides. By determining the amino acid sequence of several peptides of a protein, its whole ami...
متن کاملDeltAMT: a statistical algorithm for fast detection of protein modifications from LC-MS/MS data.
Identification of proteins and their modifications via liquid chromatography-tandem mass spectrometry is an important task for the field of proteomics. However, because of the complexity of tandem mass spectra, the majority of the spectra cannot be identified. The presence of unanticipated protein modifications is among the major reasons for the low spectral identification rate. The conventiona...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009